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Abstract 

We  consider  the  problem  of  model  selection  and  accounting  for  model  uncertainty 
in  high-dimensional  contingency  tables,  motivated  by  expert  system  applications.  The 
approach  most  used  currently  is  a  stepwise  strategy  guided  by  tests  based  on  approxi 
mate  asymptotic  P-values  leading  to  the  selection  of  a  single  model;  inference  is  then 
conditional  on  the  selected  model.  The  sampling  properties  of  such  a  strategy  are 
complex,  and  the  failure  to  take  account  of  model  uncertainty  leads  to  underestima 
tion  of  uncertainty  about  quantities  of  interest.  In  principle,  a  panacea  is  provided 
by  the  standard  Bayesian  formalism  which  averages  the  posterior  distributions  of  the 
quantity  of  interest  under  each  of  the  models,  weighted  by  their  posterior  model  prob¬ 
abilities.  However,  this  has  not  been  used  in  practice  because  computing  the  posterior 
model  probabilities  is  hard  and  the  number  of  models  is  verv  large  (often  greater  than 
10u). 

We  argue  that  the  standard  Bayesian  formalism  is  unsatisfactory  and  we  propose 
an  alternative  Bayesian  approach  that,  we  contend,  takes  full  account  of  the  true  model 
uncertainty  by  averaging  over  a  much  smaller  set  of  models.  An  efficient  search  algo¬ 
rithm  is  developed  for  finding  these  models.  We  consider  two  classes  of  models  that 
arise  in  expert  systems:  the  recursive  causal  models  and  the  decomposable  log-linear 
models.  For  each  of  these,  we  develop  efficient  ways  of  computing  exact  Bayes  factors 
and  hence  posterior  model  probabilities.  For  the  decomposable  log-linear  models,  this 
is  based  on  properties  of  chordal  graphs  and  hyper  Markov  prior  distributions  and  the 
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resultant  calculations  can  be  carried  out  locally.  The  end  product  is  an  overall  viral 
egy  for  model  selection  and  accounting  for  model  uncertainty  that  searches  efficiently 
through  the  very  large  classes  of  models  involved. 

Three  examples  are  given.  The  first  two  concern  data  sets  w  hich  h«v.-  been  analysed 
by  several  authors  in  the  context  of  model  selection.  The  third  addresses  a  urological 
diagnostic  problem. 

KEYWORDS:  Chordal  graph;  Contingency  table;  Decomposable  log  linear  model.  Expert 
system;  Hyper  Markov  distribution;  Recursive*  causal  model 
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1  Introduction 

Fruitful  approaches  to  inference  in  high-dimensional  contingency  tables  all  involve  choosing 
a  broad  class  of  models  to  be  considered  and  then  comparing  them  on  the  basis  of  how  well 
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they  predict  the  data.  Typically,  the  model  classes  are  huge  and  inference  in  the  presence  of 
the  many  competing  models  is  not  easy. 

Here  we  consider  two  classes  of  models:  the  recursive  causal  models  of  Knveri  tt  al 
(1984)  and  the  decomposable  log-linear  models  introduced  by  Goodman  (1970)  and  Haber 
man  (1974).  This  work  is  motivated  by  applications  in  expert  systems  which  use  a  belief 
network  to  represent  knowledge  and  perform  inference.  Lauritzen  and  Spiegelhalter  (1988) 
h«*.'c  described  a  method  tor  constructing  such  belief  networks.  A  recursive  causal  model  is 
elicited  from  an  “expert"  in  the  form  of  an  acyclic  directed  graph  with  nodes  representing 
random  variables  and  directed  links  representing  conditional  dependence  assumptions;  we 
will  refer  to  this  as  the  “qualitative  layer"  of  the  model.  Next,  the  joint  distribution  of  the 
random  variables  being  modeled  is  elicited;  this  is  the  “quantitative  layer".  Finally,  through 
a  series  of  graphical  and  numerical  operations,  the  recursive  causal  model  is  converted  to  a 
decomposable,  and  hence  graphical,  log-linear  model.  We  assume  throughout  that  all  links 
in  the  graph  corresponding  to  the  recursive  causal  model  are  directed.  It  is  important  that 
the  required  probabilities  be  elicited  in  the  directed  recursive  framework,  because  elicitation 
of  joint  distributions  in  the  undirected  decomposable  framework  is  usually  not  feasible. 

Potentially  the  most  important  advantage  of  constructing  expert  systems  in  this  f«*>iiion 
is  the  system’s  ability  to  modify  itself  as  data  becomes  available.  In  a  series  of  recent  papers. 
Spiegelhalter  and  Lauritzen  (1990a,  1990b),  Dawid  and  Lauritzen  (1989)  and  Spiegelhalter 
and  Cowell  (1991)  have  addressed  the  issue  of  updating  the  quantitative  layer  of  the  model. 
Building  on  this  work,  we  address  the  issue  of  updating  the  qualitative  layer — how  can 
the  graphical  structure  itself  be  updated  as  data  becomes  available?  If  we  succeed  in  our 
objective  we  will  have  truly  constructed  an  expert  system  which  can  learn. 

Currently,  the  most  used  approach  to  model  selection  in  contingency  tables  is  a  stepwise 
one,  adapted  from  stepwise  regression  by  Goodman  (1971);  see  also  Bishop,  Fienberg  and 
Holland  (1975,  Section  4.5  and  Chapter  9).  This  consists  of  sequentially  adding  and  deleting 
terms  on  the  basis  of  approximate  asymptotic  likelihood  ratio  tests,  leading  to  the  selection 
of  a  single  model.  Inference  about  the  quantities  of  interest  is  then  made  conditionally  on 
the  selected  model. 

There  are  several  difficulties  with  this  approach.  The  sampling  properties  of  the  overall 
strategy  are  complex  because  it  involves  multiple  tests  and,  at  least  implicitly,  the  compar¬ 
ison  of  non-nested  models  (Fenech  and  Westfall,  1988).  The  use  of  P- values  themselves  is 
controversial,  even  when  there  are  only  two  models  to  be  compared,  because  of  the  so-called 
“conflict  between  P- values  and  evidence”  discussed  by  Berger  and  Seilke  (1987)  and  Berger 
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and  Delampady  (1987).  One  aspect  of  this  is  that  tests  based  on  P  values  tend  to  reject 
even  apparently  satisfactory  models  when  the  sample  size  is  large;  a  dramatic  example  of 
this  w'as  discussed  b\  Raftery  (198Gb).  On  the  other  hand,  when  the  sample  size  is  small 
and  the  table  sparse,  the  asymptotic  approximations  on  which  the  ^-values  are  based  lend 
to  break  down. 

Perhaps  most  fundamentally,  conditioning  on  a  single  selected  model  ignores  model  un¬ 
certainty  and  so  leads  to  underestimation  of  the  uncertainty  about  the  quantities  of  interest. 
This  underestimation  can  be  large,  as  was  shown  bv  Regal  and  Hook  (1991 1  in  the  contin¬ 
gency  table  context  and  by  Miller  (1984)  in  the  regression  context.  One  bad  consequence  is 
that  it  can  lead  to  decisions  that  are  too  risky  (Hodges,  1987). 

In  principle,  the  standard  Bayesian  formalism  pio.ides  a  panacea  for  all  these  difficulties. 
If  .A  is  the  quantity  of  interest,  such  as  a  structural  characteristic  of  the  system  being  studied, 
a  future  observation,  or  the  utility  of  a  course  of  action,  then  its  posterior  distribution  given 
data  D  is 

K 

pr(A  |  D)  =  £pr(A  |  MkJ))pv(Mk  \  D).  (1) 

k-i 

This  is  an  average  of  the  posterior  distributions  under  each  of  the  models,  weighted  by  their 
posterior  model  probabilities.  In  equation  ( 1),  A/, . \1K  are  the  models  considered  and 


pr (iV/fc  |  D)  = 


Pr(  D  I  A/*)pr(A/t) 
£*lPr(D|  A/,)pr(A/,)’ 


(2) 


where 

pr (D  |  A/*)  =  J  pr (D  |  6k,Mk) pr(0*  j  Mk)dOk  (3) 

is  the  marginal  likelihood  of  model  A/*,  9k  is  the  (vector)  parameter  of  A/*,  pr(0k  |  A fk)  is  the 
prior  distribution  of  0k,  pr (D  )  9k,  Mk)  is  the  likelihood,  and  pr(A/*)  is  the  prior  probability 
of  Mk. 

However,  this  approach  has  not  been  adopted  in  practice.  This  appears  to  be  because 
(a)  the  posterior  model  probabilities  pr(A/*  |  D)  are  hard  to  compute  since  they  involve  the 
very  high-dimensional  integrals  in  equation  (3),  and  (b)  the  number  of  models  in  the  sum  in 
equation  (1)  is  huge.  For  example,  with  just  10  variables  (small  by  expert  system  standards) 
there  are  approximately  4.2  x  1018  recursiv  causal  models  and  1.9  x  J0n  decomposable 
models. 

One  might  hope  that  most  of  the  posterior  probability  would  be  accounted  for  by  a  small 
number  of  models  so  that  the  sum  in  equation  (1)  would  be  well  approximated  by  a  small 
number  of  terms.  Unfortunately,  this  does  not  typically  appear  to  be  the  case  because, 
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although  a  small  number  of  models  do  have  much  higher  posterior  probabilities  than  all 
the  others,  the  very  many  models  with  small  posterior  probabilities  contribute  substantially 
to  the  sum.  For  example,  Moulton  (1991)  reported  a  regression  example  with  21*  =  109b 
models  where  about  £00  models  were  needed  to  account  for  90%  of  the  posterior  probability. 

We  argue  that  the  standard  Bayesian  formalism  of  equation  (1)  is  flawed.  Adopting 
standard  methods  of  scientific  investigation,  we  show  that  accounting  for  the  true  model 
uncertainty  involves  averaging  over  a  much  smaller  set  of  models.  We  develop  simple  and 
efficient  ways  of  computing  the  posterior  model  probabilities  for  the  two  model  classes  consio 
ered.  Our  approach  is  to  take  advantage  of  the  graphical  structure  to  calculate  the  required 
probabilities  very  quickly,  while  representing  prior  opinion  in  an  easily  elicitable  form.  We 
also  describe  an  efficient  algorithm  for  searching  the  very  large  model  space. 

Putting  all  this  together  gves  us  a  simple  and  computationally  efficient  way  of  select¬ 
ing  the  best  models  and  accounting  for  model  uncertainty  in  recursive  causal  models  and 
decomposable  log-linear  models.  To  demonstrate  the  generality  of  our  approach,  our  dis¬ 
cussion  will  be  in  the  context  of  conventional  statistical  model  selection  rather  than  expert 
systems,  although  we  will  return  to  some  expert  system  specific  issues  in  Section  5.  In  Sec¬ 
tion  2  we  describe  the  principles  underlying  our  approach  to  model  selection.  In  Section  3 
we  apply  those  principles  to  the  recursive  causal  models,  while  in  Section  1  we  consider  the 
decomposable  models. 


2  Model  Selection  Strategy 

2.1  General  Principles  and  Occam’s  Razor 

We  argue  that  equation  (1)  does  not  accurately  represent  model  uncertainty.  Science  is  an 
iterative  process  in  which  competing  models  of  reality  are  compared  cn  the  basis  of  how  well 
they  predict  what  is  observed;  models  that  predict  much  less  well  than  their  competitors  are 
discarded.  Most  of  the  models  in  equation  (1)  have  been  discredited  in  the  sense  that  they 
predict  the  data  far  less  well  than  the  best  models  and  so  they  should  be  discarded;  there  is 
no  uncertainty  about  this  in  any  real  sense.  Hence  they  should  not  be  included  in  equation 
(!)• 

In  our  approach,  if  a  model  predicts  the  data  far  less  well  than  the  best  model  in  the 
class  it  will  be  discarded,  so  that  initially  we  exclude  from  equation  (1)  those  models  not 
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belonging  to  the  set 


A'  =  Mk  : 


max/{pr(A/i  |  /))} 


,c} 


pr(A/*  |  D) 

for  some  constant  C .  The  value  of  C  used  will  depend  on  the  context.  In  our  examples  we 
used  C  —  20,  by  analogy  with  the  popular  .05  cutoff  for  P- values;  Jeffreys  (1961,  Appendix 
B)  would  suggest  some  number  between  10  and  100,  while  Evett  (1991)  suggests  a  value  of 
1000  for  forensic  evidence  in  criminal  cases. 

Next  we  appeal  to  one  of  the  most  widely  accepted  norms  of  scientific  investigation, 
namely  Occam’s  razor.  Let  E  represent  the  evidence  and  pr(//|£)  the  probability  of  a 
specified  hypothesis  //  given  the  evidence  E.  Occam’s  razor  states  that  if: 

pr(//,|E)  =  pr(//2|£:)  =  ...  =  pr(//*|£) 

for  hypotheses  Ek,...,Hk,  then  the  simplest  among  Hi . Hk  is  to  be  preferred  (Kotz  and 

Johnson,  1985). 

Thus  we  also  exclude  from  equation  (1)  models  belonging  to  the  set 


B  =  \\lk  :  3M,  €  A,  M,  C  A/t,  >  1 

l  pr(Af*  |  D) 


(5) 


and  equation  (1)  is  replaced  by 


pr(A  !  D)  = 


where 


Pr( A  |  A/*,P)pr(£>  |  A/t)pr(A/t) 
Pr(£  i  A/fc)pr(A/*) 


A  =  A'\B. 


(6) 


(7) 


This  greatly  reduces  the  number  of  models  in  the  sum  in  equation  ( 1 )  and  hence  simplifies 
the  model  uncertainty  problem  a  great  deal.  Note  that  our  argument  is  not  an  approximation 
adopted  for  computational  convenience,  but  rather  an  exact  solution  based  on  the  way  science 
works.  Note  also  that  our  approach  in  equation  (6)  will  not  necessarily  give  an  answer  close 
to  that  given  by  equation  (1)  because,  due  to  the  very  large  number  of  models  in  the  class, 
the  models  discarded  may  have  a  large  total  posterior  probability  Pr(A/*  |  £),  even 

though  each  individual  model  discarded  has  a  very  small  posterior  probability. 

The  problem  thus  reduces  to  finding  the  set  A ,  and  we  now  outline  a  computational 
strategy  for  doing  this. 
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Figure  1:  Model  Selection  Strategy  -  A  Simple  Example 


2.2  Model  Selection  Strategy 

Our  approach  to  model  selection  is  a  variant  of  the  greedy-search  algorithm.  The  essentials  of 
the  approach  are  the  same  for  the  recursive  causal  models  and  the  decomposable  models  and 
could  readily  be  applied  to  more  general  graphical  models.  Posterior  model  probabilities  are 
used  as  a  metric  to  guide  the  search.  The  strategy  proceeds  out  into  model  space  away  from 
the  opening  set  of  models,  comparing  models  via  ratios  of  posterior  model  probabilities  in  a 
series  of  nested  comparisons.  The  extent  of  the  search  is  easily  controlled  and  will  depen^  on 
the  resources  available  for  specific  applications.  In  what  follows,  A/0  will  denote  the  smaller 
of  the  two  models  being  compared  and  M\  will  denote  the  larger.  In  fact,  A/0  and  A/j  will 
differ  by  just  one  link  throughout.  We  now  describe  the  elements  of  our  approach. 

Edwards  and  Havranek  (1985)  proposed  a  model  search  procedure  which  is  based  on  the 
following  rules: 

•  If  a  model  is  rejected,  then  all  its  submodels  are  rejected. 

•  If  a  model  is  accepted,  then  all  models  that  include  it  are  considered  accepted.  (We 
use  the  term  accepted  in  place  of  the  more  correct  non- rejected.) 

These  rules  were  first  suggested  by  Gabriel  (1969).  He  coined  the  term  coherence  for  testing 
procedures  satisfying  these  rules.  However,  Gabriel’s  arguments  are  predicated  on  the  use 
of  a  monotone  metric  for  model  testing.  A  test  statistic  Z(M)  is  said  to  be  monotone  if 
A/,  C  Mj  =>  Z(Mi)  >  Z(M}).  Likelihood  ratio  test  statistics  are  monotone  but  posterior 
model  probabilities  clearly  are  not.  We  now  argue  that,  in  this  context,  the  second  rule  is 
inappropriate. 
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Consider  the  (undirected)  example  in  Figure  1.  Suppose  that  we  start  with  the  saturated 
model  [ABC]  of  Figure  1(a),  and  that  when  we  compare  it  with  the  model  of  conditional 
independence  [,4C)[Z?C]  of  Figure  1(b),  we  reject  the  smaller  model  decisively.  1  hen  we 
axe  precisely  rejecting  the  conditional  independence  of  A  and  B  given  C .  This  conditional 
independence  also  holds  in  all  the  submodels  of  \AC\\BC\  and  so  we  reject  all  of  those  as 
well,  including  the  model  [A][BC]  of  Figure  1(c).  Thus,  if  we  reject  a  model,  we  reject  all 
its  submodels,  which  is  the  first  of  the  two  rules  of  Edwards  and  Havranek  (1985)  above. 

Now,  working  in  the  opposite  direction,  suppose  that  we  start  with  the  model  [.4][//C] 
of  Figure  1(c)  Comparing  it  with  [AC][fiCj  let  us  suppose  that  we  decisively  reject  the 
smaller  model.  We  are  precisely  rejecting  the  marginal  independence  of  A  in  favour  of  the 
conditional  independence  of  A  and  B  given  C .  However,  we  have  learned  nothing  about  the 
model  [ABC].  Indeed,  its  probability  could  be  even  lower  than  that  of  [,4][j9Cj!  It  follows 
that  the  second  rule  of  Edwards  and  Havranek  is  inappropriate  in  the  present  context. 

2.3  Occam’s  Window 

A  crucial  aspect  of  the  strategy  concerns  the  interpretation  of  the  ratio  of  posterior  model 
probabilities  when  comparing  two  models.  Again  we  appeal  to  Occam's  razor  which,  trans¬ 
lated  into  the  language  of  model  fitting,  we  implement  as  follows: 

•  If  the  log  posterior  odds  is  positive,  i.e.,  the  data  provides  evidence  for  the  smaller 
model,  then  we  reject  A/t  and  consider  A/o.  We  could  generalize  this  by  requiring  the 
log  posterior  odds  to  be  greater  than  some  positive  rnnc+ant  Oo  before  rejecting  M\. 

•  If  the  log  posterior  odds  is  small  and  negative,  providing  evidence  against  the  smaller 
model  which  is  not  “very  strong”  (Jeffreys,  1961),  then  we  consider  both  models. 

•  If  the  log  posterior  odds  is  large  and  negative,  i.e.,  smaller  than  Ol  ~  —  log(C)  where 
C  is  defined  by  equation  (7),  we  reject  A/o  and  consider  M\. 

Thus  there  are  three  possible  actions  following  each  comparison— see  Figure  2. 

Now  that  the  various  elements  of  the  strategy  are  in  place,  we  outline  the  search  tech¬ 
nique.  The  search  can  proceed  in  two  directions:  “Up”  from  each  starting  model  by  adding 
links,  or  “Down”  from  each  starting  model  by  dropping  links.  When  starting  from  a  non- 
saturated,  non-empty  model,  we  first  execute  the  “Down”  algorithm.  Then  we  execute  the 
“Up”  algorithm,  using  the  models  from  the  “Down”  algorithm  as  a  stalling  point.  Experi¬ 
ence  to  date  suggests  that  the  ordering  of  these  operations  has  little  impact  on  the  final  set  of 
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Strong  Evidence  for  M\ 


Figure  2:  Occam’s  Window:  Interpreting  the  log  posterior  odds 


models.  Let  A  and  C  be  subsets  of  model  space  At,  where  A  denotes  the  set  of  “acceptable" 
models  and  C  denotes  the  models  under  consideration.  For  both  algorithms,  we  begin  with 
A  =  0  and  C  =set  of  starting  models. 


BGMS-Lovvn  Algorithm 

1.  Select  a  model  M  from  C 

2.  C  *—  C  —  M  and  A  *—  .4  -r  \I 

3.  Select  a  submodel  Mo  of  M  by  removing  a  link  from  M 

4.  Compute  B  =  log 

5.  If  B  >  Or  then  A  A  —  M  and  if  M0  £  C,C  *—  C  +  M0 


6.  If  Oi  <  B  <  Or  then  if  M0  &  C,C  <—  C  +  M0 

7.  If  there  are  more  submodels  of  M,  go  to  3 

8.  If  C  ^  0,  go  to  1 
BGMS-Up  Algorithm 

1 .  Select  a  model  M  from  C  - — ... 

2.  C  * —  C  —  M  and  A  * —  A  ■(•  M 

3.  Select  a  supermodel  M\  of  M  by  adding  a  link  to  M 
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4.  Compute  B  =  log  gffgj 

5.  If  B  <  0L  then  A  *-  A  -  M  and  if  A/,  £  C.C  -  C  +  A/, 

6.  If  Ol  <  ^  <  Or  then  if  A/,  £  C,C  -  C  +  A/, 

7.  If  there  are  more  supermodels  of  A/,  go  to  3 
S.  If  C  ^  0,  go  to  l 

Upon  termination,  ,4  contains  the  set  of  potentially  acceptable  models.  Finally,  .e 
remove  all  the  models  which  satisfy  equation  (5),  where  1  is  replaced  by  exp(0«).  and  those 
models  Mk  for  which 

max/{pr(  A//  |  D)}  , 

pr(A/fc  |  D)  >  '•  ( 

A  now  contains  the  acceptable  models. 

3  The  Directed  Case — Recursive  Causal  Model  Se¬ 
lection 

3.1  Implementation 

Implementation  for  the  recursive  causal  models  proceeds  in  a  straightforward  fashion.  Con¬ 
sider  a  recursive  causal  model  for  a  set  of  random  variables  f  V.  The  model  is 

represented  by  a  directed  graph  where  each  variable  in  V  is  represented  by  a  node  in  the 
graph.  For  each  variable  v  G  V  we  define  pa(r)  to  he  the  set  of  parent  nodes  of  r.  i.e.  nodes 
iv  for  which  there  exists  a  directed  link  from  m  to  v.  The  assumptions  of  the  model  implv 
that  the  joint  distribution  of  Xv,v  G  V'.  which  we  denote  pr(V'),  is  given  by 

pr(V)  =  n  Pr(c|pa(i’)). 
t'€V 

In  early  implementations,  pr(ujpa(r))  was  assumed  to  be  fully  specified  for  all  v  by  the 
expert/data  analyst.  Spiegelhalter  and  Lauritzen  (1990a)  introduced  a  parametrisation  for 
pr(u|pa(u))  whereby  the  relationship  between  a  node  v  and  its  parents  pair)  is  fully  specified 
bv  6\,  G  @v.  This  leads  to  a  conditional  distribution  for  V': 

pr(V'|<?)  =  pr(i;|pa(u), 0V). 

v€V 

where  6  is  a  general  parameter  with  components  0V. 
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Spicgelhalter  and  Lauritzen  (1990a)  make  two  key  assumptions  wliirh  greatly  simplify 
subsequent  analysis.  The  first  assumption  is  that  of  global  indc ptndtnrt  whereby  the  pa¬ 
rameters  9V  are  assumed  mutually  independent  a  prior  i.  I  his  assumption  alone  allows  us  to 
calculate  the  likelihood  for  a  single  case: 

pr(i’)  =  I  pr  (v.O)dO  —  f  [7  pr(  Hpa(  r ).  0V  )pr{9,  )d0y  =  pr/  r  pai /•  i  i 
J  J  x  t 

where 

pr(t’jpa(r))  =  j  pT[>'\\ra{v),  0l.)])r[0v)d9v. 

1  he  second  assumption  is  that  of  local  indepcndcnct  whereby  the  parameter  0V  breaks  into 
components  corresponding  to  the  elements  of  the  state  space  of  pa(r)-  Ihese  components 
are  assumed  to  be  mutually  independent  a  priori. 

Now  consider  a  conditional  probability  distribution  pr(  r|pa(  r |  +  .  0*  j  =  O'  for  a  specific 

state  pa(e)+  of  pa(r).  We  assume  that  0 1+  has  a  Dirichlet  distribution  PjAJ" . where  k 

is  the  number  of  states  of  r.  Then  we  can  show  that 

pr(r|pa(r)  +  )  =  A  +  /^  A,+  . 

i 

If  we  observe  v  to  be  in  state  xVj  and  the  parent  state  to  be  pa(u)  +  .  we  have 

Ck-  T>\\i . a;  +  i . 

This  provides  a  straightforward  method  for  sequentially  calculating  the  ratios  of  posterior 
model  probabilities  required  for  the  model  selection  strategy.  The  elicitation  of  the  required 
Dirichlet  priors  is  feasible  provided  the  cardinality  of  pair)  is  not  too  large.  Computer- 
based  methods  for  eliciting  Dirichlet  prior  distributions  have  been  described  by  Chaloner 
and  Duncan  (1987).  If  pa(e)  is  not  observed  the  updating  becomes  more  complex — see 
Spiegelhalter  and  Lauritzen  ( 1990a, 1990b)  for  details. 

A  considerable  computational  saving  is  obtained  by  noting  that  the  sequential  updating 
of  the  distribution  of  0V  depends  on  the  states  of  v  and  pa(r)  only.  Therefore  the  likelihood 
for  all  qualitative  layers  (graphs)  having  the  same  set  pa(e)  of  parent  nodes  of  v  will  have 
identical  contributions  from  v.  For  example,  consider  the  two  recursive  causal  models  of 
Figure  3.  When  calculating  the  likelihood  for  the  model  of  Figure  3(a),  we  store  the  likeli 
hood  of  each  node/parent  combination  separately.  Now  when  subsequently  calculating  the 
likelihood  for  the  model  of  Figure  3(b),  only  the  likelihood  for  node  B  requires  recalculation 
as  the  sets  of  parent  nodes  of  A  and  C  have  not  changed. 
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To  implement  the  model  'election  strategy  describee  in  Section  2  for  the  recursive  causal 
models  an  ord^nng  of  the  nodes  must  be  pre  specified  by  the  expert /data  analyst  If  v, 
precedes  in  the  ordering,  then  a  directed  link  from  v}  to  r,  is  prohibited.  In  certain 
applications  it  may  be  possible  to  search  over  all  possible  orderings  but  this  will  typically 
not  be  the  case.  Pearl's  IC-algorithm  (Pearl  and  Verma,  1991)  induces  directed  “causal" 
structures  from  data.  An  ordering  of  the  nodes  is  not  required,  but  for  each  pair  of  nodes  r, 
and  Vj,  the  algorithm  does  involve  searching  amongst  all  subsets  of  V  -  for  cutsets 

between  v,  and  v}  (sets  which  when  conditioned  on,  render  e,  and  v}  independent.)  Cooper 
and  Herskovits  (1991)  provide  a  review  of  other  approaches. 


3.2  Examples 

3,2.1  Coronary  Heart  Disease  Risk  Factors 

Firstly  we  consider  a  data  set  which  has  been  previously  analysed  by  Edwards  and  Havranek 
(1985).  The  data  concerns  1,841  men  cross-classified  according  to  six  coronary  heart  disease 
risk  factors.  The  data  is  reproduced  in  Table  1.  The  risk  factors  are  as  follows:  A.  smoking; 
B ,  strenuous  mental  work;  C,  strenuous  physical  work;  D,  systolic  blood  pressure;  F.  ratio 
of  0  and  Ot  proteins;  F,  family  anamnesis  of  coronary  heart  disease. 


Their  likelihood  ratio-based  model  selection  strategy  selected  two  graphical  log-linear 
models:  [AC][AD E\[BC]\B E][F]  which  is  not  decomposable  and  therefore  is  not  equivalent 
to  any  recursive  causal  model,  and  \AC E)\AD E)[BC)\F)  which  is  decomposable.  A  striking 
feature  of  both  models  is  the  independence  of  F,  family  anamnesis.  The  models  are  shown 
in  Figure  4. 
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Table  1:  Risk  factors  for  Coronary  Heart  Disease 


F  E  D 

Negative  <3  <140 

>  140 

>3  <140 

>  140 

Positive  <3  <140 

>  140 

>3  <  140 

>  140 


U  No  Yes 


A 

No 

Y'es 

No 

Yes 

C 

No 

44 

40 

112 

67 

Y'cs 

129 

145 

12 

23 

No 

35 

12 

60 

33 

Yes 

109 

67 

i 

9 

No 

23 

32 

70 

66 

Yes 

50 

80 

7 

13 

No 

24 

25 

73 

57 

Y'es 

51 

63 

7 

16 

No 

5 

7 

21 

9 

Y'cs 

9 

17 

1 

4 

No 

4 

3 

11 

8 

Yes 

14 

17 

5 

•> 

*• 

No 

7 

3 

14 

14 

Yes 

9 

16 

2 

3 

No 

4 

0 

13 

11 

Yes 

5 

14 

4 

4 

(a)  (b) 

Figure  4:  Models  Selected  by  Edwards  and  Havranek 


Figure  5:  Coronary  Heart  Disease:  Recursive  Causal  Models  Selected 

To  implement  the  Bayesian  graphical  model  selection  procedure,  we  started  from  the 
saturated  model  and  used  the  “Down”  algorithm  only  (starting  from  the  empty  model  and 
using  the  “Up”  algorithm  produced  the  same  set  of  models).  All  qualitative  structures  were 
assumed  equally  likely  a  priori.  We  adopted  a  standard  Jeffreys  prior  density  throughout.  A 
natural  partial  ordering  of  the  variables  suggests  itself:  F,(B,C),  A,{E,  D).  B,ForC  could 
not  be  “influenced”  by  the  other  factors  and  must  be  exogenous,  although  the  ordering  of 
B  and  C  is  unclear.  Similarly,  D  or  E  could  hardly  influence  A,  although  the  ordering  of  E 
and  D  is  unclear.  The  four  corresponding  complete  orderings  produced  strong  evidence  for 
the  precedence  of  E  over  D,  and  indifference  as  to  the  ordering  of  B  and  C.  Several  further 
orderings  were  tried,  but  this  “natural”  ordering  resulted  in  the  models  with  highest  posterior 
probabilities.  The  selected  models  are  shown  in  Figure  5  and  their  posterior  probabilities  in 
Table  2. 

Two  models,  those  shown  in  Figures  5(a)  and  5(b),  account  for  most  of  the  posterior 
probability.  They  are  rather  similar  in  that  both  contain  the  C  —  B,  C  —  A,  A  —  E,  E  —  D 
and  A  —  D  links.  The  main  difference  between  them  lies  in  the  way  they  describe  the  effect  of 
strenuous  mental  work  ( B )  and  strenuous  physical  work  (C)  on  the  ratio  of  0  and  a  proteins 
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Table  2:  Coronary  Heart  Disease:  Posterior  Model  Probabilities  for  Recursive  Causal  Models 


Figure 

Log  marginal 
likelihood 

Posterior 
probability  % 

5(a) 

52 

5(b) 

5(c) 

-6725.4 

5 

5(d) 

-6725.7 

4 

( E ).  Model  5(a)  says  that  C  affects  E  both  directly  and  indirectly  via  A,  whereas  model 
5(b)  says  that  the  effect  of  C  on  E  is  solely  indirect,  being  mediated  by  D  and  A.  There 
is  also  some  uncertainty  about  the  presence  of  a  link  from  smoking  (A)  to  systolic  blood 
pressure  ( D ).  There  is  decisive  evidence  in  favour  of  the  marginal  independence  of  family 
anamnesis  of  coronary  heart  disease  (F). 

The  four  models  selected  correspond  very  closely  to  the  models  of  Figure  4  above.  We 
note  that  the  A  —  D  link  (smoking  and  systolic  blood  pressure)  is  present  in  both  of  the 
models  of  Figure  4  and  also  in  models  (a)  and  (b)  of  Figure  5,  but  it  is  absent  from  models  (c) 
and  (d)  of  Figure  5.  In  fact,  the  exact  test  for  zero  partial  association  of  A  and  D  reported 
by  Edwards  and  Havranek  (1985)  had  a  significance  level  of  0.04  which  was  the  largest  of 
any  of  the  links  accepted  at  the  5%  level. 

3.2.2  Women  and  Mathematics 

Our  second  example  concerns  a  survey  which  was  reported  in  Fowlkes  et  al.  (1988)  concerning 
the  attitudes  of  New  Jersey  high-school  students  towards  mathematics.  The  data  has  been 
further  analysed  by  Upton  (1991).  A  tots  of  1190  students  in  eight  schools  took  part  in  the 
survey.  Data  on  six  dichotomous  variables  was  collected: 

A.  Lecture  Attendance;  attended  or  did  not  attend; 

B.  Sex;  female  or  male; 

C.  School  Type;  suburban  or  urban; 

D.  “I’ll  need  mathematics  in  my  future  work”;  agree  or  disagree; 

E.  Subject  Preference;  maths/science  or  liberal  arts; 
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Figure  6:  Women  and  Mathematics:  Recursive  Causal  Model  Selected 
F.  Future  Plans;  college  or  job; 

Upton  (1991)  reports  that  a  model  selection  procedure  based  on  the  AIC  criterion 
(Akaike,  1973)  selects  [ABC E][C D F][BC D][D E F]  while  a  procedure  based  on  the  BIC  cri¬ 
terion  (Raftery,  1986a)  selects  the  much  simpler  [A\[B E][C E][C F][B D][D E][D F}.  Clearly 
an  important  difference  between  these  two  models  is  the  treatment  of  A. 

The  Bayesian  graphical  model  selection  procedure  started  from  the  empty  model  and 
used  the  “Up”  algorithm.  It  is  clear  that  B  (Sex)  cannot  be  influenced  by  other  variables 
and  must  be  exogenous.  Initially  it  .vas  also  assumed  that  C  (School  Type)  was  exogenous. 
An  exhaustive  search  over  all  consequent  orderings  produced  the  single  model  shown  in 
Figure  6. 

The  selected  model  is  similar  to  the  model  selected  by  Upton’s  BIC  procedure.  The 
model  selected  by  AIC  clearly  over-fits  the  data  (Upton,  1991).  It  is  of  interest  to  note  the 
direction  of  the  link  from  D  to  F.  Both  Upton  (1991)  and  Fowlkes  et  al.  (1988)  treat  D  as 
a  response  variable  and  Upton’s  path  diagram  shows  a  directed  link  from  F  to  D.  However, 
the  data  provides  strong  evidence  that  the  direction  of  the  influence  is  from  D  to  F,  i.e.  that 
students’  attitudes  towards  mathematics  influence  their  future  plans,  rather  than  the  other 
way  around.  The  ability  of  the  selected  model  to  predict  is  unaffected  by  the  direction  of 
the  E  —  D  link. 

Further  analysis  removed  the  restriction  that  C  be  exogenous.  The  data  now  provides 
some  support  for  the  presence  of  a  link  from  E  to  C  although  its  interpretation  is  somewhat 
unclear. 
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The  Undirected  Case — Decomposable  Model  Selec¬ 
tion 

4.1  Implementation 

To  implement  the  strategy  for  the  decomposable  models,  we  rely  heavily  on  a  recent  funda¬ 
mental  paper  by  Dawid  and  Lauritzen  (1989),  hereafter  DL.  We  consider  three  issues  which 
are  specific  to  model  selection  for  the  decomposable  models.  Firstly,  how  should  we  add 
and  remove  links  whilst  efficiently  ensuring  that  all  the  models  created  are  decomposable? 
Secondly,  given  any  two  decomposable  models  M  and  A/*,  is  it  possible  to  generate  A/*  from 
Af,  adding  or  removing  only  one  edge  at  a  time  but  staying  within  the  class  of  decomposable 
models?  Finally,  how  do  we  calculate  the  required  posterior  model  probabilities? 

The  first  two  issues  are  addressed  by  the  following  two  lemmas: 

LEMMA  1  Let  Q  —  (V,  E)  be  a  chordal  graph  with  vertices  V  and  edges  E  and  let  Q'  =  (V,  E ') 
be  a  chordal  subgraph  of  Q  with  exactly  one  edge,  e,  less.  Then  e  is  contained  in  exactly  one 
clique  of  Q. 

Proof.  This  follows  from  Lemma  3  of  Frydenberg  and  Lauritzen  (1989). 

Therefore  the  model  selection  strategy  must  remove  only  links  which  are  members  of  a 
single  clique.  When  adding  links,  the  strategy  must  not  create  any  chordless  four-cycles. 

LEMMA  2  Let  Q  =  {V,E)  and  Q'  =  (V,  E')  both  be  chordal  graphs  such  that  E'  C  E  and 
has  k  less  edges  than  Q .  Then  there  is  an  increasing  sequence  Q'  =  Go  C  ...  C  Gk  —  G  of 
chordal  graphs  that  differ  by  exactly  one  edge. 

Proof.  This  follows  from  Lemma  5  of  Frydenberg  and  Lauritzen  (1989). 

Now  we  address  the  calculation  of  the  posterior  model  probabilities.  Following  DL,  we 
consider  a  decomposable  model  M  for  a  set  of  random  variables  Xv ,  v  €  V.  Let  I  denote  the 
set  of  possible  configurations  of  X.  Let  the  quantitative  layer  of  M  be  specified  by  8.  Then 
the  distribution  of  8  is  determined  by  the  clique  marginal  probability  tables  8c  —  (^c)cec 


17 


where  C  denotes  the  set  of  cliques  of  M : 


9(i)  = 


Ucec  M»c)  ■  j 

fiscs  «s(«s)  ’ 


5  denotes  the  system  of  separators  in  an  arbitrary  perfect  ordering  of  C. 
For  each  clique  C  £  C,  let 


Ac  =  (Ac(ic))icerc 


be  a  given  table  of  arbitrary  positive  numbers  and  let  T>(\ c)  denote  the  Dirichlet  distribution 
for  9C  with  density 

t(»Ci*c)«  n  «c(ic)*c(ic)-1. 

*c€Xc 

where  £,c  9c(ic)  =  1  and  9(ic )  >  0. 

Now  let  us  suppose  that  the  collection  of  specifications  V(\c),C  £  C  are  constructed  in 
such  a  way  that  for  any  two  cliques  C  and  D  in  C  we  have: 


Ac(icnc)  -  Ao(lcn£>)- 


Then  DL  show  that  there  exists  a  unique  strong  hyper  Markov  distribution  for  6  over  M 
that  has  density  V{\ c)  for  all  C  £  C.  DL  call  this  the  hyper  Dirichlet  distribution  for  9. 
A  distribution  for  9  is  strong  hyper  Markov  if  and  only  if  9&\b,Qb\a  and  ^AnB  are  mutually 
independent  whenever  AC\  B  is  complete  and  separates  A  from  B.  It  follows  that  by  letting 
A0  =  H.gzAi,  the  likelihood  for  a  single  case  is  given  by: 


pr(v) 


Tlcec  Ac 
Ao(IIs€5  Ajr) 


From  Lemma  3  we  have  that  updating  can  be  carried  out  one  clique  at  a  time: 


LEMMA  3  If  the  prior  distribution  C(9)  is  strong  hyper  Markov,  the  posterior  distribution 
of  9  is  the  unique  hyper  Markov  distribution  C *  specified  by  the  clique-marginal  distributions 
{£c  :  C  £  C) ,  where  is  the  posterior  distribution  of  9c  based  on  its  prior  distribution  Cc 
and  the  clique-specific  data  Xc  =  xc- 


Proof.  This  is  Corollary  9  of  DL. 

The  posterior  distribution  for  9c  given  data  from  the  marginal  table  corresponding 
to  clique  C  is  2>(Ac  -f  nc)- 
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Figure  7:  Coronary  Heart  Disease:  Decomposable  Models  Selected 
Consider  the  Bayes  factor 

pr(D|M0) 

01  pr(D|Mj) 

where  M0  and  Mi  are  decomposable  and  Mo  is  obtained  from  Mi  by  deleting  one  edge  e 
linking  u  with  v.  From  Lemma  1  we  have  that  e  is  contained  in  a  single  clique,  C  say,  of  Mi. 
Let  Cu  =  C  —  {v},Cv  =  C  —  {u},C0  =  C  —  {u,t>}.  Then  DL  show  that  the  Bayes  factor  is 
given  by: 

B  =  Pc^Dc^Pc^Dc^ 

Pc0{Dc0)Pc{Dc) 

Thus,  the  required  decomposable  model  comparisons  can  be  carried  out  very  rapidly  with 
calculations  local  to  single  cliques. 


4.2  Examples 

4.2.1  Coronary  Heart  Disease  Risk  Factors 

Firstly  we  consider  again  the  coronary  heart  disease  risk  factor  data  of  Edwards  and  Havranek 
(1985)  which  is  shown  in  Table  1.  We  note  that  the  model  of  Figure  4(a)  which  was  selected 
by  the  Edwards  and  Havranek  procedure  is  not  decomposable  and  hence  will  not  be  selected 
by  our  procedure. 

The  selection  procedure  started  from  the  saturated  model  and  used  the  “Down”  algo¬ 
rithm.  All  qualitative  structures  were  assumed  equally  likely  a  priori.  A  rfTT>^-'rd  Je^^ys 
prior  was  adopted  for  6c,  C  €  C.  Just  two  models  were  selected  and  they  are  shown  in  Figure 
7.  Starting  from  the  empty  model  and  using  the  “Up”  algorithm  resulted  in  the  same  two 
models.  The  corresponding  posterior  probabilities  are  shown  in  Table  3. 
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Table  3:  Coronary  Heart  Disease:  Posterior  Model  Probabilities  for  Decomposable  Models 


Figure 

Model 

Log  marginal 
likelihood 

Posterior 
probability  % 

7(a) 

[BC)\ACE)[ADE)[F\ 

| 

92 

7(b) 

[. ABC){ABE}[ADE)[F ] 

8 

Table  4:  Women  and  Mathematics:  Posterior  Model  Probabilities  for  Decomposable  Models 


Figure 

Model 

Log  marginal 
likelihood 

Posterior 
probability  % 

mm 

[A}[BDE][CDF] 

WBMM 

75 

EH 

[A\[BDE}[DF\[CF\ 

flEHH 

25 

The  model  of  Figure  7(a)  was  also  selected  by  the  directed  model  selection  procedure  and 
by  Edwards  and  Havranek  (1985).  The  model  of  Figure  7(b)  is  essentially  a  decomposable 
version  of  the  directed  model  of  Figure  5(b)  and  Edwards  and  Havranek’s  model  of  Figure 
4(a).  It  is  interesting  to  note  that  a  model  identical  to  Figure  7(a)  except  for  the  A  -  D  link 
falls  just  outside  Occam’s  window  in  the  undirected  selection. 

Over  all,  the  model  selection  exercise  indicates  that  there  is  very  strong  evidence  for  the 
B  —  C,  A  —  C.  A  —  E  and  D  —  E  links,  with  evidence  for  the  A  —  D  link  that  is  strong 
but  somewhat  less  so.  There  is  also  some  evidence  for  the  C  —  E  and  B  —  E  links,  but  it 
seems  that  one  of  these  alone  is  enough  to  describe  the  data,  and  it  is  not  fully  clear  which 
one  is  better.  Again,  as  in  the  directed  case,  there  is  decisive  evidence  for  the  marginal 
independence  of  F . 

4.2.2  Women  and  Mathematics 

We  consider  again  the  survey  data  previously  analysed  by  Fowlkes  et  al.  (1988)  and  Upton 
(1991).  We  note  the  models  selected  in  Upton  (1991)  are  not  graphical  and  hence  will 
not  be  selected  by  our  procedure.  The  procedure  adopted  was  identical  to  that  adopted 
for  the  example  of  Section  4.9  1  The  two  models  selected  are  shown  in  Figure  8  and  the 
corresponding  posterior  probabilities  are  shown  in  Table  4. 

As  in  the  directed  case,  the  selected  models  are  close  to  the  models  selected  by  the  BIC 
model  selection  procedure  carried  out  by  Upton  (1991).  However  there  is  uncertainty  about 
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Figure  8:  Women  and  Mathematics:  Decomposable  Models  Selected 

the  C  —  D  link  (School  Type  and  “I’ll  need  mathematics  in  my  future  work”)  which  is  not 
apparent  in  Upton’s  analysis.  The  data  strongly  supports  the  marginal  independence  of  A. 

4.2.3  Scrotal  Swellings 

Our  final  example  concerns  the  diagnosis  of  scrotal  swellings.  Data  on  299  patients  was 
gathered  at  the  Meath  Hospital,  Dublin,  Ireland  under  the  supervision  of  Mr.  M.R.  Butler. 
We  consider  a  cross-classification  of  the  patients  according  to  one  lisease  class,  Hernia  (i/), 
and  7  binary  indicants  as  follows:  A ,  possible  to  get  above  the  swelling;  B,  swelling  tran- 
silluminates;  C,  swelling  separate  from  testes;  D ,  positive  valsalva/stand  test;  E,  tender; 
F,  pain;  <7,  evidence  of  other  urinary  tract  infections.  The  data  is  reproduced  in  Table 
5.  There  are  28  possible  links  to  be  considered  by  the  selection  procedure  in  this  example. 
In  the  absence  of  prior  expert  opinion,  computation  times  can  be  prohibitive.  Clearly,  if 
the  starting  point  for  the  selection  procedure  were  close  to  the  models  for  which  the  data 
provides  evidence,  this  problem  could  be  overcome.  With  this  objective  we  adopted  the 
following  heuristic  procedure:  firstly,  Bayes  factors  for  each  of  the  28  links  are  calculated  by 
comparing  the  saturated  model  with  the  28  sub-models  generated  by  removing  single  links. 
Links  for  which  the  data  provides  evidence  in  this  manner  are  then  used  as  a  starting  point 
for  the  selection  procedure  (if  the  model  thus  constructed  is  not  decomposable,  some  of  the 
links  may  be  removed  or  additional  ones  may  be  added.)  The  starting  model  is  shown  in 
Figure  9. 

Now  the  “Up”  algorithm  is  executed,  followed  by  the  “Down”  algorithm  (or  vice  versa). 
Note  that  if  the  starting  links  are  badly  chosen,  the  complete  procedure  has  the  opportunity 
to  remove  them,  although  in  this  example,  the  final  model  contains  all  the  links  from  the 
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Table  5:  Scrotal  Swelling  data 


Hernia 

A 

B 

Indicants 

C  D  E 

F 

G 

Count 

N 

N 

N 

N 

N 

Y 

Y 

N 

1 

N 

Y 

N 

N 

N 

N 

N 

N 

16 

N 

Y 

N 

N 

N 

N 

Y 

N 

3 

N 

Y 

N 

N 

N 

Y 

Y 

N 

51 

N 

Y 

N 

N 

N 

Y 

Y 

Y 

17 

N 

Y 

N 

Y 

N 

N 

N 

N 

30 

N 

Y 

N 

Y 

N 

N 

N 

Y 

1 

N 

Y 

N 

Y 

N 

N 

Y 

N 

3 

N 

Y 

N 

Y 

N 

Y 

N 

N 

1 

N 

Y 

N 

Y 

N 

Y 

Y 

N 

20 

N 

Y 

N 

Y 

N 

Y 

Y 

Y 

4 

N 

Y 

N 

Y 

Y 

N 

N 

N 

36 

N 

Y 

N 

Y 

Y 

N 

Y 

N 

3 

N 

Y 

Y 

N 

N 

N 

N 

N 

38 

N 

Y 

Y 

N 

N 

N 

N 

Y 

1 

N 

Y 

Y 

N 

N 

N 

Y 

N 

3 

N 

Y 

Y 

N 

N 

Y 

Y 

N 

3 

N 

Y 

Y 

Y 

N 

N 

N 

N 

21 

N 

Y 

Y 

Y 

N 

Y 

Y 

N 

2 

Y 

N 

N 

Y 

Y 

N 

N 

N 

39 

Y 

N 

N 

Y 

Y 

N 

Y 

N 

5 

Y 

Y 

N 

Y 

Y 

N 

N 

N 

1 

Figure  9:  Starting  Model  for  Scrotal  Swelling  Example 
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Figure  10:  Scrotal  Swellings:  Decomposable  Models  Selected 

starting  model.  Two  models  were  selected  by  this  procedure  and  they  are  shown  in  Figure 
10.  The  corresponding  posterior  probabilities  are  shown  in  Table  6. 

The  result  of  primary  interest  here  is  the  importance  of  A  (possible  to  get  above  swelling) 
and  D  (valsalva/stand  test)  with  respect  to  Hernia  diagnosis.  Both  indicants  can  be  estab¬ 
lished  through  simple  procedures  at  physical  examination.  The  only  real  model  uncertainty 
which  is  exhibited  concerns  the  relationship  between  C  (swelling  separate  from  testes)  and 
E  (tender).  Analysis  of  further  cross-classifications  extracted  from  this  database  also  yield 
similarly  sparse  models. 

Table  6:  Scrotal  Swellings:  Posterior  Model  Probabilities  for  Decomposable  Models 


Figure 

Model 

Log  marginal 
likelihood 

Posterior 
probability  % 

mm 

[AH\[DH\[BDE\[C  DE][EF][EG] 

75 

BH 

[. AH][DH][BDE}[CD][EF]{EG] 

25 
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Figure  11:  Over-parametrisation  of  the  Dirichlet  Prior 

5  Expert  Systems 

As  discussed  in  the  introduction,  model  updating  for  expert  system  applications  can  be 
carried  out  in  either  the  recursive  causal  model  framework  or  the  undirected  decomposable 
framework.  However,  while  the  decomposable  model  selection  strategy  of  Section  4  can  be 
efficiently  implemented  and  avoids  the  computational  problems  associated  with  orderings 
in  the  directed  models,  direct  elicitation  of  the  required  Dirichlet  distributions,  P(AC),  will 
typically  be  intractable.  In  this  section  we  outline  a  simple  procedure  for  eliciting  the  required 
priors.  This  had  previously  been  a  barrier  to  the  application  of  the  approach  of  DL.  We  also 
address  briefly  the  use  of  prior  model  probabilities  to  control  the  search  in  very  large  model 
spaces. 


5.1  Elicitation  for  Decomposable  Model  Selection 

Our  objective  is  to  use  the  initial  recursive  causal  model  as  a  framework  in  which  to  elicit  the 
quantitative  layer  and  ensure  that  the  resultant  prior  distribution  is  hyper  Dirichlet.  We  note 
that  for  both  the  recursive  causal  and  the  decomposable  model  selection  strategies,  the  hyper 
Dirichlet  framework  provides  a  straightforward  mechanism  for  evolving  prior  distributions 
as  links  are  added  and  deleted. 

Consider  a  simple  recursive  causal  model  with  two  binary  variables  as  shown  in  Figure 
11(a).  This  model  is  decomposable  and  the  corresponding  undirected  representation  is  given 
in  Figure  11(b). 

Using  the  approach  outlined  in  Section  3,  the  model  of  Figure  11(a)  would  require  the 
elicitation  of  prior  Dirichlet  distributions  for  pr(BjA), pr(£?|A)  and  pr(A)  with  a  total  of  6 
parameters.  The  model  of  Figure  11(b)  however  has  only  one  clique  {A,  B }  and  its  prior  hy¬ 
per  Dirichlet  distribution  has  just  4  parameters.  In  general,  the  Dirichlet  prior  representation 
considered  in  Section  3  is  over-parametrised  relative  tu  the  hyper  Dirichlet  distribution. 


Figure  12:  Lauritzen  and  Spiegelhalter’s  Dyspnoea  Example 

Our  approach  is  to  regard  the  parameters  A„|pa<v)  as  “equivalent  prior  samples"  which  are 
elicited  subject  to  constraints  which  ensure  consistency.  The  implied  equivalent  prior  sample 
for  A  satisfies  the  independence  relationships  of  the  recursive  causal  model  and  specifies  the 
complete  hyper  Dirichlet  distribution.  Marginalising  onto  each  clique  C  €  C  provides  the 
clique  prior  distributions. 

Thus  for  the  example  of  Figure  11,  the  equivalent  prior  sample  is  an  unconstrained 
two-by-two  table.  Having  elicited  a  Dirichlet  prior  for  pr(A),  there  are  only  two  degrees 
of  freedom  remaining  for  the  prior  distributions  of  pr(B\A)  and  pr(J9|/4).  Effectively  the 
size  of  the  equivalent  prior  sample  and  its  breakdown  between  A  and  ,4  is  fixed  by  the 
elicitation  of  the  prior  for  pr(/4),  and  only  one  further  parameter  is  now  required  for  each  of 
pr(  B\ A)  and  pr(  B\A).  We  envisage  software  which  would  display  all  three  prior  distributions 
simultaneously,  and  maintain  the  required  constraints  interactively. 

Since  for  all  v  G  V,  {u  U  pa(e)}  C  C  for  some  C  €  C,  implementation  of  the  procedure 
in  more  general  models  is  straightforward.  We  consider  a  fictitious  example  previously 
considered  by  Lauritzen  and  Spiegelhalter  (1988)  and  provide  a  detailed  description  of  the 
elicitation  procedure.  The  (elicited)  recursive  causal  model  is  shown  in  Figure  12. 

Elicitation  of  the  hyper  Dirichlet  prior  and  the  marginal  prior  distributions  for  the  cliques 
of  the  corresponding  decomposable  model  proceeds  as  follows: 

1.  Elicit  distributions  for  pr(a),  pr(r|a). 

Degrees  of  freedom  :  4 
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2.  Elicit  distribution  for  pr(t|r,  A).  The  marginal  distribution  of  prt  r  )  has  already  been 
specified  in  1.  Furthermore  the  model  implies  rlA 

Degrees  of  freedom  :  5 

3.  Elicit  distribution  for  pr(£|t).  The  marginal  distribution  of  pr(cj  hits  already  been 
specified  in  2. 

Degrees  of  Freedom  :  2 

4.  Elicit  distribution  for  pr(A|<y).  The  marginal  distribution  of  pr(A)  has  already  been 
specified  in  2. 

Degrees  of  Freedom  :  2 

5.  Elicit  distribution  for  pr(J|o).  The  marginal  distribution  of  pr(<r)  has  already  ln*en 
specified  in  4. 

Degrees  of  Freedom  :  2 


5a.  Since  the  model  implies  M.Jjcr.  we  have  that: 


pr(  A,  $,(t) 


pr(A.rr)pr(  3. a) 


and  hence  we  derive  the  distribution  of  pr(A,pbrr)  and  the  marginal  distribution  of 
pr(A,  0). 


5b.  Since  the  model  implies  cJ_J|A,  we  have  that: 


pr(c,  l3,\) 


pr(c.  A)pr( 3,  A) 
Pr(  A) 


and  hence  we  derive  the  distribution  of  pr(f,:?,A)  and  the  marginal  distribution  of 

pr(c, 


6.  Elicit  distribution  for  pr(<5|t,/?). 
Degrees  of  Freedom  :  4 
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5.2  Model  Priors 


In  the  examples  considered  above,  the  prior  model  probabilities  pr(A/)  were  assumed  equal 
(Cooper  and  Herskovits,  1991,  also  assume  that  models  are  equally  likely  a  prxon).  In  general 
this  can  be  unrealistic  and  may  also  be  expensive  and  we  will  want  to  penalise  the  search 
strategy  as  it  moves  further  away  from  the  model(s)  provided  by  the  expert(s).  Ideally  one 
would  elicit  prior  probabilities  for  all  possible  qualitative  structures  from  the  expert  but  this 
will  be  feasible  only  in  trivial  cases. 

For  models  with  fewer  than  15  to  20  nodes,  prior  model  probabilities  may  be  approxi¬ 
mated  by  eliciting  prior  probabilities  for  the  presence  of  every  possible  link  and  assuming 
that  the  links  are  mutually  independent,  as  follows:  Let  £  =  Ep  U  Ea  denote  the  set  of  all 
possible  links  for  the  nodes  of  model  A/,  where  Ep  denotes  the  set  of  links  which  are  present 
in  model  M  and  Ea  denotes  the  absent  links.  For  every  link  e  €  E  we  elicit  pr(e),  the  prior 
probability  that  link  e  is  included  in  A/.  The  prior  model  probability  is  then  approximated 
by 

pr(Af)  a  Y[  pr(e)  ]\{\  -  pr(e)). 

Prior  link  probabilities  from  multiple  experts  are  treated  as  independent  sources  of  informa¬ 
tion  and  are  simply  multiplied  together  to  give  pooled  prior  model  probabilities.  Clearly, 
the  contribution  from  each  expert  could  be  weighted. 

For  applications  involving  a  larger  number  of  nodes  or  where  the  elicitation  of  link  prob¬ 
abilities  is  not  possible,  we  could  assume  that  the  ‘‘evidence”  in  favour  of  each  link  included 
by  the  expert(s)  in  the  elicited  qualitative  structure(s)  is  “substantial”  or  “strong”  but  not 
“very  strong”  or  “decisive”  (Jeffreys,  1961).  For  example,  we  could  assume  that  the  evidence 
in  favour  of  an  included  link  lies  at  the  center  of  Occam's  window  corresponding  to  a  prior 
link  probability  for  all  e  €  Ep  of 

Pr(e)=  l+exp(2S2ly' 

Similarly,  the  prior  link  probabilities  for  e  €  Ea  are  given  by 


pr(e)  = 


exp( 


Ol+Qr ) 
_ 2  _ > 


1  +  exp(— a) 


In  the  directed  case  it  may  be  possible  to  construct  a  prior  distribution  on  the  space  of 
orderings — see  Critchlow  (1985)  for  further  discussion. 
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6  Discussion 


We  have  outlined  an  overall  strategy  for  model  selection  and  accounting  for  model  uncertainty 
in  two  important  classes  of  models  for  high-dimensional  contingency  tables.  This  involves  a 
redefinition  of  the  Bayesian  model  uncertainty  formalism,  an  efficient  way  of  computing  exact 
Bayes  factors  that  exploits  the  graphical  structure,  and  an  algorithm  for  quickly  searching 
through  the  very  large  model  classes  involved.  The  resulting  procedure  is  quite  efficient: 
for  the  example  of  Section  4.2.1,  approximately  3,000  model  comparisons  per  minute  can  be 
carried  out  on  a  Sun  IPC. 

There  is  a  considerable  literature  on  model  selection  for  multidimensional  contingency 
tables;  this  is  generally  concerned  with  the  selection  of  a  single  “best”  model.  Most  of  it  is 
based  on  the  asymptotic  properties  of  goodness-of-fit  statistics  (Wermuth  (1976),  Havranek 
(1984),  Whittaker  (1984),  Edwards  and  Havranek  (1985)  or  Fowlkes  et  al.  (1988)).  There 
are  also  approaches  based  on  information  criteria  and  discrepancy  measures  (Gokhale  and 
Kullback,  1978;  Sakamoto,  1984;  Linhart  and  Zucchini,  1986).  A  recent  review  is  provided 
by  Upton  (1991)  who  advocates  the  use  of  the  BIC  statistic.  The  calculation  of  Bayes  factors 
for  contingency  table  models  has  been  considered  by  Spiegelhalter  and  Smith  (1982),  Raftery 
(1986a,  1988),  Spiegelhalter  and  Lauritzen  (1990a)  and  Spiegelhalter  and  Cowell  (1991). 

Pearl  and  Verma  (1991)  and  Glymour  et  al.  (1987)  have  proposed  str.'-t*>gies  for  recovering 
causal  structure  from  data.  While  these  authors’  objectives  differ  from  ours,  their  procedures 
for  selecting  directed  graphical  structures  have  much  in  common  with  our  recursive  causal 
model  selection  strategy. 

Cooper  and  Herskovits  (1991)  and  Anderson  et  al.  (1991)  have  examined  model  selection 
in  the  context  of  probabilistic  expert  systems.  In  both  cases,  model  selection  is  based  solely 
on  data  analysis  and  the  incorporatior  of  prior  expert  opinion  is  not  considered.  Cooper 
and  Herskovits  (1991)  outline  a  Baye.  \n  strategy  which  seeks  out  the  “best”  recursive 
causal  model  for  the  qualitative  layer,  w  ?re  “best”  is  taken  to  mean  the  single  model  with 
maximum  probability.  The  algorithm  s  with  a  model  with  no  links  and  at  each  stage 
adds  the  directed  link  which  most  increa  he  model  probability.  The  user  must  pre-specify 
an  ordering  of  the  nodes.  Anderson  et  .  1991)  carry  out  their  search  in  the  undirected 

graphical  model  framework  using  a  meth  troduced  by  Kreiner  (1987).  The  difficulties 
with  large  sparse  tables  mentioned  above  .  voided  by  using  exact  tests  when  comparing 
models. 

While  we  believe  that  the  methods  we  pr  ie  provide  a  workable  approach  to  qualita- 
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tive  updating  in  expert  systems,  some  issues  remain.  Spiegelhalter  and  Lauritzen  (1990a) 
and  other  authors  have  expressed  concerns  about  automatically  updating  the  qualitative 
structure  without  reference  to  the  domain  expert.  Such  concerns  need  to  be  addressed  in 
the  context  of  real  expert  systems.  Extension  of  the  methods  to  include  the  more  general 
graphical  models  of  Wermuth  and  Lauritzen  (1990)  and  Edwards  (1990)  will  also  be  im¬ 
portant  for  expert  system  applications.  Missing  data  will  frequently  be  a  problem  and  we 
are  currently  exploring  a  number  of  techniques  for  the  incorporation  of  missing  data  in  the 
model  selection  strategy. 
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We  consider  the  problem  of  model  selection  and  accounting  for  model  uncertainty  in  high¬ 
dimensional  contingency  tables,  motivated  by  expert  system  applications.  The  approach 
most  used  currently  is  a  stepwise  strategy  guided  by  tests  based  on  approximate  asymptotic 
P-values  leading  to  the  selection  of  a  single  model;  inference  is  then  conditional  on  the 
selected  model.  The  sampling  properties  of  such  a  strategy  are  complex,  and  the  failure  to 
take  account  of  model  uncertainty  leads  to  underestimation  of  uncertainty  about  quantities 
of  interest.  In  principle,  a  panacea  is  provided  by  the  standard  Bayesian  formalism  which 
averages  the  posterior  distributions  of  the  quantity  of  interest  under  each  of  the  models, 
weighted  by  their  posterior  model  probabilities.  However,  this  has  not  been  used  in  practice 
because  computing  the  posterior  model  probabilities  is  hard  and  the  number  of  models  is 
very  large  (often  greater  than  10n). 

We  argue  that  the  standard  Bayesian  formalism  is  unsatisfactory  and  we  propose  an 
alternative  Bayesian  approach  that,  we  contend,  takes  full  account  of  the  true  model  un¬ 
certainty  by  averaging  over  a  much  smaller  set  of  models.  An  efficient  search  algorithm  is 
developed  for  finding  these  models.  We  consider  two  classes  of  models  that  arise  in  expert 
systems:  the  recursive  causal  models  and  the  decomposable  log-linear  models.  For  each  of 
these,  we  develop  efficient  ways  of  computing  exact  Bayes  factors  and  hence  posterior  model 
probabilities.  For  the  decomposable  log-linear  models,  this  is  based  on  properties  of  chordal 
graphs  and  hyper  Markov  prior  distributions  and  the  resultant  calculations  can  be  carried 
out  locally.  The  end  product  is  an  overall  strategy  for  model  selection  and  accounting  for 
model  uncertainty  that  searches  efficiently  through  the  very  large  classes  of  models  involved. 

Three  examples  are  given.  The  first  two  concern  data  sets  which  have  been  analysed  by 
several  authors  in  the  context  of  model  selection.  The  third  addresses  a  urological  diagnostic 
problem. 

KEYWORDS:  Chordal  graph;  Contingency  table;  Decomposable  log-linear  model;  Expert 
system;  Hyper  Markov  distribution;  Recursive  causal  model. 


